141 research outputs found

    Longest Common Extensions in Sublinear Space

    Get PDF
    The longest common extension problem (LCE problem) is to construct a data structure for an input string TT of length nn that supports LCE(i,j)(i,j) queries. Such a query returns the length of the longest common prefix of the suffixes starting at positions ii and jj in TT. This classic problem has a well-known solution that uses O(n)O(n) space and O(1)O(1) query time. In this paper we show that for any trade-off parameter 1≀τ≀n1 \leq \tau \leq n, the problem can be solved in O(nτ)O(\frac{n}{\tau}) space and O(τ)O(\tau) query time. This significantly improves the previously best known time-space trade-offs, and almost matches the best known time-space product lower bound.Comment: An extended abstract of this paper has been accepted to CPM 201

    Metatheory of actions: beyond consistency

    Get PDF
    Consistency check has been the only criterion for theory evaluation in logic-based approaches to reasoning about actions. This work goes beyond that and contributes to the metatheory of actions by investigating what other properties a good domain description in reasoning about actions should have. We state some metatheoretical postulates concerning this sore spot. When all postulates are satisfied together we have a modular action theory. Besides being easier to understand and more elaboration tolerant in McCarthy's sense, modular theories have interesting properties. We point out the problems that arise when the postulates about modularity are violated and propose algorithmic checks that can help the designer of an action theory to overcome them

    High-throughput marker discovery in melon using a self-designed oligo microarray

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genetic maps constitute the basis of breeding programs for many agricultural organisms. The creation of these maps is dependent on marker discovery. Melon, among other crops, is still lagging in genomic resources, limiting the ability to discover new markers in a high-throughput fashion. One of the methods used to search for molecular markers is DNA hybridization to microarrays. Microarray hybridization of DNA from different accessions can reveal differences between them--single-feature polymorphisms (SFPs). These SFPs can be used as markers for breeding purposes, or they can be converted to conventional markers by sequencing. This method has been utilized in a few different plants to discover genetic variation, using Affymetrix arrays that exist for only a few organisms. We applied this approach with some modifications for marker discovery in melon.</p> <p>Results</p> <p>Using a custom-designed oligonucleotide microarray based on a partial EST collection of melon, we discovered 6184 putative SFPs between the parents of our mapping population. Validation by sequencing of 245 SFPs from the two parents showed a sensitivity of around 79%. Most SFPs (81%) contained single-nucleotide polymorphisms. Testing the SFPs on another mapping population of melon confirmed that many of them are conserved.</p> <p>Conclusion</p> <p>Thousands of new SFPs that can be used for genetic mapping and molecular-assisted breeding in melon were discovered using a custom-designed oligo microarray. A portion of these SFPs are conserved and can be used in different breeding populations. Although improvement of the discovery rate is still needed, this approach is applicable to many agricultural systems with limited genomic resources.</p

    Fingerprints in Compressed Strings

    Get PDF
    The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. In this paper we show how to construct a data structure for a string S of size N compressed by a context-free grammar of size n that answers fingerprint queries. That is, given indices i and j, the answer to a query is the fingerprint of the substring S[i,j]. We present the first O(n) space data structures that answer fingerprint queries without decompressing any characters. For Straight Line Programs (SLP) we get O(logN) query time, and for Linear SLPs (an SLP derivative that captures LZ78 compression and its variations) we get O(log log N) query time. Hence, our data structures has the same time and space complexity as for random access in SLPs. We utilize the fingerprint data structures to solve the longest common extension problem in query time O(log N log l) and O(log l log log l + log log N) for SLPs and Linear SLPs, respectively. Here, l denotes the length of the LCE

    Longest Common Extensions in Trees

    Get PDF
    The longest common extension (LCE) of two indices in a string is the length of the longest identical substrings starting at these two indices. The LCE problem asks to preprocess a string into a compact data structure that supports fast LCE queries. In this paper we generalize the LCE problem to trees and suggest a few applications of LCE in trees to tries and XML databases. Given a labeled and rooted tree TT of size nn, the goal is to preprocess TT into a compact data structure that support the following LCE queries between subpaths and subtrees in TT. Let v1v_1, v2v_2, w1w_1, and w2w_2 be nodes of TT such that w1w_1 and w2w_2 are descendants of v1v_1 and v2v_2 respectively. \begin{itemize} \item \LCEPP(v_1, w_1, v_2, w_2): (path-path \LCE) return the longest common prefix of the paths v1⇝w1v_1 \leadsto w_1 and v2⇝w2v_2 \leadsto w_2. \item \LCEPT(v_1, w_1, v_2): (path-tree \LCE) return maximal path-path LCE of the path v1⇝w1v_1 \leadsto w_1 and any path from v2v_2 to a descendant leaf. \item \LCETT(v_1, v_2): (tree-tree \LCE) return a maximal path-path LCE of any pair of paths from v1v_1 and v2v_2 to descendant leaves. \end{itemize} We present the first non-trivial bounds for supporting these queries. For \LCEPP queries, we present a linear-space solution with O(log⁡∗n)O(\log^{*} n) query time. For \LCEPT queries, we present a linear-space solution with O((log⁥log⁥n)2)O((\log\log n)^{2}) query time, and complement this with a lower bound showing that any path-tree LCE structure of size O(n \polylog(n)) must necessarily use Ω(log⁥log⁥n)\Omega(\log\log n) time to answer queries. For \LCETT queries, we present a time-space trade-off, that given any parameter τ\tau, 1≀τ≀n1 \leq \tau \leq n, leads to an O(nτ)O(n\tau) space and O(n/τ)O(n/\tau) query-time solution. This is complemented with a reduction to the the set intersection problem implying that a fast linear space solution is not likely to exist

    EVEREST: automatic identification and classification of protein domains in all protein sequences

    Get PDF
    BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. RESULTS: Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. CONCLUSION: The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at [1], provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site
    • 

    corecore